The current project explores a dataset with information about white wines, where their chemical properties are shown side-by-side with a quality rank, being it the median grade given by professional tasters. The objective is to find out if there’s a clear relationship between the perceived quality of a wine and its chemical properties.
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
The white wines table has 13 variables and 4898 observations.
All input variables (those based on physicochemical tests) are numerical.
The output variable, quality, is an integer.
The X variable is the table index. It’s not useful and may pollute our analisys with unecessary plots. We will drop it.
ww <- subset(ww, select = -X)After filtering X variable, 12 variables remain: wine quality and the 11 chemical properties, all described below.
Output variable (based on qualitative tests):
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
The wine quality histogram returned a shape resembling a normal curve, with most ones concentrated from 5-7. No wine in our dataset got a quality grade worst than 3 or better than 9.
The fixed acidity shows a normal-like distribution, with most values ranging from 5 to 9 g/dm3.
Volatility acidity is left skewed, with most ammounts ranging from 0.15 to 0.40 g/dm3. A log10 transformation helped us better see the data distribution.
Citric acid is found in small quantities, with most wines ranging between 0 and 0.8 g/dm3. It’s curve has a normal-like shape with a few extreme outliers above 0.8 g/dm3.
The graph is left skewed, with most common ammounts of residual sugars between 1 and 3 g/dm3. We again benefitted from log10 transformation for better visualization, but unlike what we expected, it didn’t return a bell-shaped format but rather a bimodal shape.
The majority of chlorides are concentrated between between 0.03 and 0.10 g / dm3 in a normal-like distribution, with some extreme outliers. These outlies refrain us from clearly seeing the finer distribution, which was achieved once more through a log10 transformation.
Free Sulfur Dioxide has a few wild outliers, which prevent us from properly seeing the distribution. For this reason, the ‘X’ axis was represented through a log10 scale. The grand majority of values range from 15 to 70 mg / dm3.
Most white wines have a Total Sulfur Dioxide between 60 and 220 mg / dm3, fairly normally distributed towards 130/140.
As stated before, the density of wine is close to that of water depending on the percent alcohol and sugar content. This variable should then be highly correlated to those two variables. Most values are in a narrow range from 0.990 and 0.999 g / dm3.
The pH scale range from 0 (very acidic) to 14 (very basic), but all white wines are normal-like distributed in a narrow range from 2.7 to 3.8.
Sulphates (potassium sulphate) is a wine additive wich acts as an antimicrobial and antioxidant agent. Most wines have a concentration between 0.4 and 0.6 g / dm3.
Alcohol content has a relatively wide range: most wines contain from 8.7% to 13%.
Nothing can be said about it so far, but there’s fewer wines that reach a higher alcohol concentration.
Based on our preliminary examination of individual variables and their value distributions, we noticed most variables are either normal-like distributed or left skewed. Special attention to residual sugars, which has a bimodal distribution.
No data cleansing or any other form of data transformation was performed so far: outliers were kept in the database and no new variables were crated.
Correlation measures fall between -1 and 1, being numbers close to -1 negatively correlated and those close to +1 positively correlated.
x <= -0.9 | x >= +0.9 –> very strong correlation-0.9 < x <= -0.7 | +0.9 > x >= +0.7 –> strong correlation
-0.7 < x <= -0.5 | +0.7 > x >= +0.5 –> moderate correlation
-0.5 < x <= -0.3 | +0.5 > x >= +0.3 –> weak correlation
x > -0.3 | x < +0.3 –> negligible correlation
The correlation matrix above shows that no variable has even a moderate direct correlation to quality, being alcohol content the one which comes closer, with a weak correlation of 0.4.
Strong Correlations:
1. Negative correlation between alcohol and density (-0.8);
2. Positive correlation between residual sugar and density (+0.8).
Moderate Correlations:
1. Negative correlation between alcohol and residual sugar (-0.5);
2. Negative correlation between alcohol and chlorides (-0.4);
3. Positive correlation between total sulfur dioxide and density (+0.5);
4. Positive correlation between total sulfur dioxide and free sulfur dioxide (+0.6).
High levels of acetic acid can lead to an unpleasant, vinegar taste.
Levels of volatile acidity above 0.36 g / dm3 is rare for an above average wine.
Above 0.5 g / dm3 is almost certain a bad wine.
Can’t say much about citric acid concentration, except that for the highest quality wines (with very few individual cases) there’s a higher concentration.
We can’t say much about residual sugar concentration. It could be associated to its convertion to alcohol, which is a sign of a high quality wine. But it could also mean the grapes had a lot more sugar at the beginning fo the process, leading to no conclusion at all.
We should be very careful with this variable and, if possible, leave it out of our analysis.
There’s a clear median trend showing the lower the chloride level, the better the wine. But this trend is not corroborated by the correlation (-0.2). It’s certainly due to the overlapping interquartile ranges and wide variabilities.
There’s a clear convergence of the best wines towards a range between 100 and 150 g / dm3. The tendency also favors low concentrations over high ones.
There’s a tendency showing that the lower the density, the better. Highly correlated to alcohol content.
There’s not a clear tendency between pH and quality.
Nothing can be said about sulphate concentration.
Alcohol presents a tendency between its concentration and quality. Usually the more alcohol content, the better.
So far we have empirically observed that a good wine should have:
Doing a quick web search I verified that densitometry is a known method for determining wine alcohol content.
Source: The Australian Wine Research Institute.
As density is strongly (inversely) correlated to alcohol content, it will be dropped from further analysis.
A new variable, grouping the wines into low (3-5), mid (6) and high (7-9) quality will be created. The mid group, represented by quality grade 6, is not only the median, mode and mean, but also accounts alone for more wines than the other two groups. For this reason it has a group for itself.
ww$quality.cut <- cut(ww$quality, c(2, 5, 6, 9), labels = c("Low Quality", "Mid Quality", "High Quality"), ordered_result = TRUE)
summary(ww$quality.cut)## Low Quality Mid Quality High Quality
## 1640 2198 1060
The focus here goes to the scatterplots (left, below diagonal). The intention is to find separate groups by quality (colors) in the intersection between two other variables.
The ones found were:
1. Alcohol x volatile acidity
2. Alcohol x pH
It means volatile acidity and pH need to be analyzed for an indirect effect on quality.
At this point we will analyze the relationship between the relevant variables among themselves and also quality.
To make it simpler, a new dataframe containing this subset will be created.
ww_sub <- subset(ww, select = c(volatile.acidity, total.sulfur.dioxide, pH, alcohol, quality.cut))After subsetting, the relations became easier to see by naked eye.
Note: volatile acidity has a skewed distribution, benefitting from a log10 transformation for better visualization.
Trhough the scatterplot graph we see a higher concentration low quality wines (red dots) at a lower alcohol content and vice-versa. As seen before while analysing box blots, alcohol is a desired characteristic.
At every Alcohol level, higher volatile acidity is associated to lower wine quality. It’s not a desired characteristic.
In most cases, higher pH is preferred over low pH, but it’s not a general rule. Low alcoholic white wines get good grades when associated with lower pH (higher acidity).
Nothing can be said about this relation.
In most cases, lower levels of Total Sulfur Dioxide is preferred over high levels.
This relation could not be perceived before. It’s an indirect effect on quality.
Observing the most relevant features through multivariate scatterplots, it was possible to closely analyze what was empirically observed through bivariate analysis and the multivariate matrix:
The general desired features in a white wine are:
1. High alcohol content %
2. Low volatile acidity level
3. Low Total Sulfur Dioxide level
Alcohol content is, alone, the most relevant feature to explain wine quality. The correlation is clear just by seeing the boxplot, with its steep curve and small range among the highest quality wines (grade 9). When calculated, it showed a 0.4 correlation to quality. Still weak, but the highest among all variables.
It’s also clear that something else than alcohol must have gone really wrong with the lowest quality wines (grades 3 & 4, mostly). Despite a raise in alcohol content in relation to slighter better wines, the final result was awful.
The variable description already states that high of levels of volatile acidity (acetic acid) can lead to an unpleasant, vinegar taste.
As a standalone feature, volatile acidity influence can only be perceived in very low quality wines. But seen in conjunction with alcohol content, it’s clear that high volatile acidity is not a desired feature at all.
Total sulfur dioxide (SO2) becomes evident in the nose and taste of wine above 50ppm, which accounts for more than 99% of all analyzed wines. Seeing apart from other variables its effect over quality is inconclusive. But seeing in conjunction with volatile acidity it becomes clear it’s not desired.
Putting it all together, higher alcohol concentration is better than lower concentration. No matter what alcoholic level, lower volatile acidity gives us a better wine. And no matter what volatile acidity level, lower total sulfur dioxide concentration is preferred. These three variables combined gives most certainly a good white wine.
Based on this exploratory data analysis (EDA), it was possible not only have a first impression about the dataset, its variables values ranges and existing relations between them, but also to have a first grasp on chemical properties effects over quality.
Through the analysis it was also clear that not every effect can be directly spotted, being necessary to make log transformations and limit value ranges to avoid extreme outliers taking most graph space. It was also necessary indirect relations to spot important features. For exemple total sulfur dioxide (SO2): first, the relation between alcohol content and quality. Second, the effect of volatile acidity for every alcohol content. At last, the effect of SO2 for eveyr volatile acidity range.
The current analysis can be further developed into other variables properties by using a statistical model like decision tree, for example. The database could also be used to classify unknown new wines using machine learning techniques. For now only the most obvious relations were taken into consideration.
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Australian Wine Research Institute. Website. 08 Oct. 2017. https://www.awri.com.au/industry_support/winemaking_resources/laboratory_methods/chemical/alcohol/.